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Abstract 

Blind Compressed Sensing (BCS) is an extension of Compressed Sensing (CS) where the optimal sparsifying 
dictionary is assumed to be unknown and subject to estimation (in addition to the CS sparse coefficients). Since 
the emergence of BCS, dictionary learning, a.k.a. sparse coding, has been studied as a matrix factorization problem 
where its sample complexity, uniqueness and identifiability have been addressed thoroughly. However, in spite of 
the strong connections between BCS and sparse coding, recent results from the sparse coding problem area have 
not been exploited within the context of BCS. In particular, prior BCS efforts have focused on learning constrained 
and complete dictionaries that limit the scope and utility of these efforts. In this paper, we develop new theoretical 
bounds for perfect recovery for the general unconstrained BCS problem. These unconstrained BCS bounds cover 
the case of overcomplete dictionaries, and hence, they go well beyond the existing BCS theory. Our perfect recovery 
results integrate the combinatorial theories of sparse coding with some of the recent results from low-rank matrix 
recovery. In particular, we propose an efficient CS measurement scheme that results in practical recovery bounds 
for BCS. Moreover, we discuss the performance of BCS under polynomial-time sparse coding algorithms. 


I. Introduction 

The sparse representation problem involves solving the system of linear equations y = Ax G M. d where x € M m 
is assumed to be /c-sparse; i.e. x is allowed to have (at most) k non-zero entries. The matrix A G R dxrn is typically 
referred to as the dictionary with m > d elements or atoms. It is well-known that x can be uniquely identified if 
A satisfies the so called spark conditio/^ Meanwhile, there exist tractable and efficient convex relaxations of the 
combinatorial problem of finding the (unique) A:-sparse solution of y = Ax with provable recovery guarantees (T|. 

A related problem is dictionary learning or sparse coding 0 which can be expressed as a sparse factorization 
ll3l of the data matrix Y = AX (where both A and X G M mxn are assumed unknown) given that each column 
of X is ^-sparse and A satisfies the spark condition as before. A crucial question is how many data samples (n) 
are needed to uniquely identify A and X from Y? Unfortunately, the existing lower bound is (at best) exponential 
n> {k + 1)(™j assuming an equal number of data samples over each fc-sparse support pattern in X 0|, J5J. 

In this paper, we address a more challenging problem. In particular, we are interested in the above sparse 
matrix factorization problem Y = AX (with both sparsity and spark conditions) when only p < d random linear 
measurements from each column of Y is available. We would like to find lower bounds for n for the (partially 
observed) matrix factorization to be unique. This problem can also be seen as recovering both the dictionary A and 
the sparse coefficients X from compressive measurements of data. For this reason, this problem has been termed 
Blind Compressed Sensing (BCS) before 0, although the end-goal of BCS is the recovery of Y. 

Summary of Contributions We start by establishing that the uniqueness of the learned dictionary over random 
data measurements is a sufficient condition for the success of BCS. Perfect recovery conditions for BCS are derived 
under two different scenarios. In the first scenario, fewer random linear measurements are available from each data 
sample. It is stated that having access to a large number of data samples compensates for the inadequacy of 
sample-wise measurements. Meanwhile, in the second scenario, it is assumed that slightly more random linear 
measurements are available over each data sample and the measurements are partly fixed and partly varying over 
the data. This measurement scheme results in a significant reduction in the required number of data samples for 
perfect recovery. Finally, we address the computational aspects of BCS based on the recent non-iterative dictionary 
learning algorithms with provable convergence guarantees to the generating dictionary. 

Mohammad Aghagolzadeh and Hayder Radha are with Department of Electrical and Computer Engineering, Michigan State University, 
East Lansing, MI, USA. Email: aghagoll@msu.edu and radha@msu.edu. 

'That is every 2 k < d columns of A are linearly independent. 


A. Prior Art on BCS 

BCS was initially proposed in |6j where it was assumed that, for a given random Gaussian sampling matrix 
<I> £ M. pxd (p < d), Z = <I> Y is observed. The conclusion was that, assuming the factorization Y = AX is unique, 
Z = BX factorization would also be unique with a high probability when A is an orthonormal basis. However, 
it would be impossible to recover A from B = when p < d. It was suggested that structural constraints 
be imposed over the space of admissible dictionaries to make the inverse problem well-posed. Some of these 
structures were sparse bases under known dictionaries, finite set of bases and orthogonal block-diagonal bases lf6l . 
While these results can be useful in many applications, some of which are mentioned in |6], they do not generalize 
to unconstrained overcomplete dictionaries. 

Subsequently, there has been a line of empirical work on showing that dictionary learning from compressive 
data—a sufficient step for BCS—can be successful given that a different sampling matrix is employed for each 
data sampled (i-e. each column of Y). For example, Q uses a modified K-SVD to train both the dictionary and the 
sparse coefficients from the incomplete data. Meanwhile, (8), ||9), ifTOl use generic gradient descent optimization 
approaches for dictionary learning when only random projections of data arc available. The empirical success of 
dictionary learning with partial as well as compressive or projected data triggers more theoretical interest in finding 
the uniqueness bounds of the unconstrained BCS problem. 

Finally, we must mention the theoretical results presented in the pre-print ifTTTl on BCS with overcomplete 
dictionaries while X is assumed to lie in a structured union of disjoint subspaces lfl2l . It is also proposed that the 
results of this work extend to the generic sparse coding model if the ‘one-block sparsity’ assumption is relaxed. We 
argue that the main theoretical result in this pre-print is incomplete and technically flawed as briefly explained here. 
In the proof of Theorem 1 of lITTl . it is proposed that (with adjustment of notation) “assignment [ofY’s columns to 
rank-kf disjoint subsets] can be done by the (admittedly impractical) procedure of testing the rank of all possible 
Q*) matrices constructed by concatenating subsets of kp + 1 column vectors, as assumed in /j?]/”. However, it is 
ignored that the entries of Y are missing at random and the rank of an incomplete matrix cannot be measured. As 
it becomes more clear later, the main challenge in the uniqueness analysis of unconstrained BCS is in addressing 
this particular issue. Two strategies to tackle this issue that are presented in this paper are: 1) increasing the number 
of data samples and 2) designing and employing measurement schemes that preserve the low-rank structure of Y’s 
sub-matrices. 

This paper is organized as follows. In Section HU we provide the formal problem definition for BCS. Our main 
results are presented in Section |III| We present the proofs in Section [IV] Practical aspects of BCS are treated in 
Section IVI where we explain how provable dictionary learning algorithms, such as lfT6l . can be utilized for BCS. 
Finally, we conclude the paper and present future directions in Section [VTl 

B. Notation 

Our general convention throughout this paper is to use capital letters for matrices and small letters for vectors and 
scalars. For a matrix X € W nxn , Xij € M denotes its entry on row i and column j, x t € R m denotes its Tth column 
and vec(X) £ M mn denotes its column-major vectorized format. The inner product between two matrices A and B 
(of the same sizes) is defined as ( A , B) = trace (A 1 B ). Let Spark(A) denote the smallest number of A’s columns 
that are linearly dependent. A is //-coherent if Vi f j we have ||^i|* 2 ’|°a^|| 2 < l J - Finally, let [m] := {1,2,... , m} 
and let (^) denote the set of all subsets of [ m] of size k. 

II. BCS Problem Definition 

Construct the data matrix Y £ M dxn by concatenating n signal vectors y } £ (for j from 1 to n). Throughout 
this paper, we make the following assumptions about the sampling operator and the data sparsity. It must be noted 
that the following assumptions over the sparse coding of Y are minimal among existing sparse coding assumptions 
for provable uniqueness guarantees; see e.g. 0], 0. 

Vote that the linear form Z = BX is no longer valid which is possibly a reason for the lack of a theoretical extension of BCS to this 


case. 



Linear measurement Suppose p < d linear measurements are taken from each signal y 3 G as in Zj = ^jUj G 
MP where ( I> ? G K pxd is referred to as the sampling matrix. We could also represent the measurements as a linear 
projection of the signal onto the row-space of the sampling matrix^: 

We will use MP(Y) = [zf ,..., z//] T G to denote the observations and V P (Y) G M rfxn to denote the 
projected matrix that is a concatenation of all yj. Specifically, when entries of each <I> ; are drawn independently 
from a random Gaussian distribution with mean zero and variance 1/d, we use the notations M G (Y) and VjjfY). 

Sparse coding model Assume Y = AX where A G R dxm denotes the dictionary (rn > d in the overcomplete 
setting) and X G M m,xn is a sparse matrix with exactly k non-zero entries per column and Spark(A) > 2k. 
Additionally, assume that each column of X is randomly drawn by first selecting its support S G (^) uniformly 
at random and then filling the support entries with random i.i.d. values uniformly drawn from a bounded interval, 
e.g. (0,1] C M. We denote by y™ the set of feasible Y under the described sparse coding model. Note that the 
assumption Spark(A) > 2k is necessary to ensure a unique X even when A is known and fixed. 

Remark As noted and proved in |0, when Y G y™, with probability one, no subset of k (or less) columns of Y 
is linearly dependent. Also with probability one, if a subset of k + 1 columns of Y are linearly dependent, then all 
of the k + 1 columns must have the same support. 

Given the above definitions, we can now formally express the problem definition for BCS: 

BCS problem definition Recover Y G y™ from A4 P (Y) given M. p , m and k. 

Our results throughout this paper are mainly developed for the class of Gaussian measurements MP = Xi T /,. 
However, it is not difficult to extend these results to the larger class of continuous sub-Gaussian distributions for 
M p . 


III. Main Results 

To start with, assume that there are exactly £ columns in X for each support pattern S G S where S = 


. For 


better understanding and without loss of generality, one can assume that the data samples are ordered according to 

the following sketch for A': , 

n = i\o\ samples 


/* \ 
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The best known bound for l, for the factorization Y = AX to be unique (with probability one) under the specified 
random sparse coding model, is l > k + 1. This results in an exponential sample complexity n > (k + 1) (™j. 
Specifically, it is said that ‘Y = AX factorization is unique’ if there exist a diagonal matrix D G R mxm and a 
permutation matrix P such that for any other feasible factorization Y = A'X' G T™, we have A' = APD. Clearly, 
this ambiguity makes it more challenging to prove the uniqueness of the dictionary learning problem. Meanwhile, 
authors in |5l propose a strategy for handling the permutation and scaling ambiguity which is reviewed in Lemma 

rai 

Through the following lemma, we can establish that the uniqueness of the learned dictionary is a sufficient 
condition for the success of BCS (proof is provided in Appendix). 


Lemma III.l. Suppose for every pair AX,A'X' G yj!' that satisfy M. P G (A'X') = Xi'/f AX) with p > 2k, 
A! = APD for some diagonal matrix D and permutation matrix P. Then A'X' = ,4 A" with probability one. 


3 Note that Zj can be computed from tjj using the relationship Zj = <f> ji)j . Therefore, y : j and Zj carry the same amount of information 
about yj given the sampling matrix <t>j. 








Briefly speaking, existing uniqueness results exploit the fact that the rank of each group of columns in the above 
sketch is bounded above by k. This makes it possible to uniquely identify groups of samples that share the same 
support pattern. Meanwhile, when only A4 P (Y) is available, it might not be possible to uniquely identify these 
groups. Nevertheless, it is noted in @ that t > fc|<S| = k("/) ensures uniqueness without the need for grouping, at 
the cost of significantly increasing the required number of data samples (compared to £ > k + 1). 

In our initial BCS uniqueness result, we use the pigeon-hole strategy of (51 which results in a less practical 
bound n > k\S\ 2 even when Y is completely obscrvccQ. Yet, it is interesting to explore the implications of a 
finite n that ensures a successful BCS for the general sparse coding model. The CS theory requires the complete 
knowledge of A to uniquely recover X and Y from M P (Y). Meanwhile, our results assert that A, X and Y can 
be uniquely identified from JY[ p (Y) given a large but finite number of samples n. Necessary proofs for the results 
of this section are presented in the following section. 

Theorem III.l. Assume p > 2k and there are exactly £ columns in X for each S € S. Then Y E yjf can be 
perfectly recovered from M.q(Y) with probability one given that t > 2fc (^,~^) +1 

Corollary III.l. With probability at least 1 — 3, Y E yj" can be perfectly recovered from A4/, ( Y) given that 
p > 2k and n > 2fc ^_ 2 ^ 1 (™) 2 • 

Aside from the intellectual implications of Theorem IIII. 1 1 and Corollary IIII. II discussed above, the stated bounds 
for l and n are clearly not very practical. To reduce these bounds while guaranteeing the success of BCS, we 
introduce a hybrid measurement scheme that we explain below. 


A. BCS with hybrid measurements 


Definition (Hybrid Gaussian Measurement) In a hybrid measurement scheme, J 



where F E MP fXd 


stands for the fixed part of sampling matrix and Vj € M. PvXd stands for the varying part of the sampling matrix. 
The total number of measurements per column is p = pj + p v < d. In a hybrid Gaussian measurement scheme, 
F and V\ through V n are assumed to be drawn independently from an i.i.d. zero-mean Gaussian distribution with 
variance 1/d. The observations corresponding to F and Vfs are denoted by FY E M p/ x n and M. P q(Y) E W' ,,n 
respectively. 


As mentioned earlier, the hybrid measurement scheme was designed to reduce the required number of data 
samples for perfect BCS recovery. In particular, as formalized in Lemma IIV.41 the fixed part of the measurements 
is designed to retain the low-rank structure of each /c-dimcnsional subspace associated with a particular S E S. 
Meanwhile, the varying part of the measurements is essential for the uniqueness of the learned dictionary. 

Theorem III.2. Assume p > 3A: + 1 and there are exactly £ columns in X for each S E S. Then Y E y/" can be 
perfectly recovered from hybrid Gaussian measurements with probability one given that £ > 2k ^ l _ 3^-1 ^ 

Remark Similar to the statement of Corollary IIII. 11 it can be stated that BCS with hybrid Gaussian measurement 
succeeds with probability at least 1 — (3 given that n > (™) ■ The proof follows the proof of Corollary 

EED 

Remark Although we mainly follow the stochastic approach of (51 in this paper, we could also employ the 
deterministic approach of 0 to arrive at the uniqueness bound in Theorem IIII.21 In 0], an algorithm (which is not 
necessarily practical) is proposed to uniquely recover A and X from Y. This algorithm starts by finding subsets of 
size l of Y’s columns that are linearly dependent by testing the rank of every subset. Dismissing the degenerate 
possibilities, these detected subsets would correspond to samples with the same support pattern in X. Under the 
assumptions in Theorem 1111.21 it is possible to test whether l columns in Y arc linearly dependent (with probability 
one), as a consequence of Lemma IIV.41 in the following section. 


4 Authors in 0 propose a deterministic approach using the pigeon-hole principle as well as a probabilistic approach with smaller bounds 
for n. 

Regenerate instances of A' are dismissed by adding extra assumptions in the deterministic sparse coding model. Meanwhile, as pointed 
out in Q, such degenerate instances of X would have a probability measure of zero in a random sparse coding model 
























Until now, our goal was to show that A (and subsequently X) is unique given only CS measurements. As we 
mentioned before, uniqueness of A is a sufficient condition for the success of BCS. Consider the scenario where 
not all support patterns S E S are realized in X or for some there is not enough samples to guarantee recovery. 
For such scenarios, we present the following theorem. 

Theorem III.3. Assume p > 3k + 1 and let 

S = {5|5 € S, | J(S)| > 7 }C 5 

where 7 = 1 an d J(S) denotes the set of indices of columns of X with support S. Then, under hybrid 

Gaussian measurement, Yjts) f or all S £ S can be perfectly recovered with probability one. 

IV. Proofs 

The following crucial lemma from [5) handles the permutation ambiguity of sparse coding. 

Lemma IV.l ((5J, Lemma 1). Assume Spark(A) > 2k for A E and let S = (^). If there exists a mapping 

ir : S —>• S such that 

span {As} = span j A^.^ j for every S € S 

then there exist a permutation matrix P and a diagonal matrix D such that A' = APD. 

The following lemma from random matrix theory, along with Lemma IIV. 1 1 are the main ingredients of our first 
main result (proof is provided in the Appendix). 

Lemma IV.2. Assume A. 1} £ are rank-k matrices and Xi P G is a Gaussian measurement operator with 

p > (2 k(d + l — 2k) + 1 )/£. If A4 q(A) = Xi p G (B), then A = B with probability one. 

Proof of Theorem MIL 1 1 

Assume A'X' is an alternate factorization that satisfies A'X' € yf" and M P G {AI X') = Ai G (AX). We will prove 
A' = APD for some diagonal D and some permutation matrix P using Lemma IIV. 11 Consider a particular support 
pattern S € S and let J(S) C [n] denote the set of indices of V’s columns that have the sparsity pattern S. By 
definition, |J(S')| = £ > k \X) where k' = (2 k(d — 2k) + 1 )/{jp — 2k). Due to the pigeon-hole principle, there 
must be at least k! columns within that share some particular support pattern S' € S. In other words, if 

J'(S') denotes the set of indices of X'’s columns that have the support pattern S', then | J(S) Cl ■!'{S')\ > k'. For 
simplicity, denote I = J(S) Cl J'(S'). Clearly, rank(AXj) = rank{A'X' I ) = k (because |5| = IS"! = k ), and we 
have 

M^A'Xf) = M p g (AX/) 

According to Lemma ITV.2I if p > (2 k{d + k' — 2k) + 1 )/k' or equivalently k' > {2k{d — 2k) + 1 )/{p — 2k), then 
A'X} = AX 1 with probability one. Meanwhile, since |/| > k' > k + I, A'X} = AX j necessitates that 

span {As} = span {A'g,} ( 1 ) 

Finally, since A satisfies the spark condition, it is not difficult to see that vr(S') = S' is a bijective map. To explain 
more, assume there exists some S" f S such that 

span {A 5 //} = span {A^, } 

Combining with ([Hi we arrive at 

span {As} = span {A#//} , 

which contradicts the spark condition for A for S" f S. Therefore, 7 r must be injective. Now, since S is a finite 
set and 7 r is an injective mapping from S to itself, it must also be surjective and, thus, bijective. ■ 

In order to have at least £ columns in X for each support S G S in the random sparse coding model y™, we 
must have more than just n = £\S\ data samples. The following result from |[5l quantifies the number of required 
data samples to ensure at least £ columns per each S G S with a tunable probability of success. 

Lemma IV.3 (0, §IV). For a randomly generated X with n = £(™) and /3 E [0,1], with probability at least 1 — (5, 
there are at least (3£ columns for each support pattern S E S. 











Proof of Corollary 1770} Proof is fairlv trivial. According to Lemma |IV.3| we need samples to 

guarantee that with probability at least 1 — /3 there are at least £ samples in X for each support pattern 5 £ S. 
In Theorem 1111.11 we established that £ > (™) guarantees the success of BCS under Gaussian sampling. 

Therefore, n > ^') 2 guarantees the desired uniqueness. ■ 

In order to prove the results for the hybrid measurement scheme, we present the following lemma which is 
proved in the Appendix. 

Lemma IV.4. Assume F £ W fXd is drawn from an i.i.d. zero-mean Gaussian distribution (with pj < d). Let 
Yj £ M dx l^l denote the columns of Y indexed by the set J. If rank(FYj) = k < py, then rankfYj ) = k with 
probability one. 

Proof of Theorem MIL 21 Assume A'X' is an alternate factorization that satisfies A'X' £ y™, Mq ( A'X') = 
AAq(AX) and FACX' = FAX. Also assume pf = k +1 and p v = p — k—l. Consider a particular support pattern 
S' £ S and let J’(S') C [n] denote the set of indices of X'A columns that have the same sparsity pattern S'. 
Clearly, 

rank (FA'Xjp S ,^j < rank (a! X'jp S ,^\ = k 
Therefore, if pf > k + 1, then pj > rank (^FA'X'jp S ,^\ and according to Lemma IIV.4I 

rank (fA!X' jp S ,^\ = rank (a! X'jp S ,^\ = k 

with probability one. Hence, rank ( FAXjps')) = k. Again using Lemma IIV.4I with Pf > k + 1, 

rank (AX jpg,\) = rank (FAXjpg')) = k 

with probability one. Therefore, all the columns in Xjpgp must have the same support, namely S. Note that since 
J'(S') C J(5), \J'(S')\ < | J(S)\ = £. Meanwhile, 

E i J, ( s ')i - <(*) 

d / G«S 

necessitates that |(S")| = £ for every S' £ S. Therefore, \J(S) n J'(S')\ = |/| = £. Now, given 

Mq(A'X'j) = M P Q(AXi) 

according to Lemma HV.21 if £ > (2 k(d — 2k) + I)/ (p v — 2k), then A 1 X’ r = AXj with probability one. Meanwhile, 
since |I| = l > k + 1, A'X' r = AXj necessitates that 

span {As} = span {A's,} (2) 

Finally, since A satisfies the spark condition, vr(S') = S' is a bijective map and AI = APD for some diagonal D 
and permutation matrix P according to Lemma IIV. 1 1 ■ 

Proof of Theorem \III.3\ Recall that for every S £ S we have | <7(5)1 > 7 > k + 1. Assume py = k + 1 
and p v = p — k — 1 as before. Having pj > k + 1 allows testing whether a subset of k + 1 columns of Y are 
linearly dependent (have a rank of k) with probability one. Therefore, by doing an exhaustive search among every 
sub-matrix Yj with J £ we are able to find subsets of J(S) (of size k + 1) if |J(5)| >k + 1. Moreover, we 

can combine and complete these subsets to uniquely identify every rank-/,: sub-matrix Yj (g) with | J(S)\ > k + 1. 

Now, among these sub-matrices, those with | J(5)| > 7 can be recovered perfectly (with probability one) since, 
for any rank -k matrices Ej(S) ar *d 

M p £(Y m ) = M*£(Y m ) 

with 

p v > (2 k(d + | J(5)| - 2k) + 1)/| J(5)| 

or | J(5)| > 2fc ^^ 2 ^ +1 implies Lqtq = Yj(s) according to Lemma IIV.21 ■ 

























V. Algorithmic Performance of BCS under Hybrid Measurements 

Recall that in the dictionary learning (DL) problem, the data matrix Y G W Ixn is given where Y = A*X* G Yf 
and the task is to factorize Y = AX G YJI 1 such that A = A* PD for some permutation matrix P and diagonal 
matrix D. Unfortunately, the corresponding optimization problem is non-convex (even with i\ relaxation). The 
majority of existing DL algorithms are based on the iterative scheme of starting from an initial state Y = A^ ty X^ (y 
and alternating between updating X while keeping A® fixed and updating A- t+}) while keeping X^ t+l> fixed, 
each corresponding to a convex problem. It has been recently shown that if the initial dictionary is sufficiently 
closed to A*PD for some P and D, then the iterative algorithm converges to A*PD under certain incoherency 
assumptions about A* llT5l . Similar guarantees have been derived for the well-known K-SVD algorithm | |20l . 

Furthermore, DL from incomplete or corrupt data has also been tackled in several studies. In particular, DL from 
compressive measurements has been addressed in Q, jSJ, 0, iflOl where different iterative DL algorithms are 
modified to accommodate the compressive measurements. In some cases, these modifications have been justified 
by showing that the output of each iteration does not significantly deviate from the reference output based on the 
complete data. However, to best of our knowledge, there arc no convergence guarantees to A*PD for these iterative 
algorithms. As we mentioned before, a successful DL from compressive measurements is a sufficient condition for a 
successful BCS. In this section, we plan to investigate the utility of a recently proposed (non-iterative) DL algorithm 
lfl 6 l with guarantees for the approximate recovery of A*PD for an incoherent A*. One would hope that A*PD 
can be approximated from Y with fewer data samples than is required for the exact identification of A*PD which 
was the topic of previous sections. 

Below, we review the main result of Ifl 6 l and analyze the performance of their DL algorithm if only hybrid 
Gaussian measurements were available. Recall that in our BCS measurement scheme, pj fixed and p v varying 
linear measurements are taken from each sample for a total of p = Pf + p v linear measurements (per sample). 
Before presenting their result, we need to introduce some new notation as well as modifications to the sparse coding 
model to reflect the model used in llT 6 l . In particular, let X G M m denote the random vector of sparse coefficients 
where its distribution class T is defined below. Hence, each Xj denotes an outcome of X. Also, let Xj denote the 
random variable associated with the 7th entry of X. 

Definition (Distribution class T) The distribution is in class T if i) YXj f 0: X t G [— C, — 1] U [1, C] and E [Xj\ = 0. 
ii ) Conditioned on any subset of coordinates in X being non-zero, the values of X, are independent of each other. 
Distribution has bounded l-wise moments if the probability that X is non-zero in any subset S of l coordinates is 
at most c/ times UieS ¥ [ x i + °] where c = CK 1 )- 

Remark Similar to lfl 6 l . in the rest of paper we will assume (7 = 1. Derived results generalize to the case C > 1 
by loosing constant factors in guarantees. 

Definition Two dictionaries A, B G M rfxm are column-wise e-close, if there exists a permutation vr and 9 G {±l} m 
such that Vi G [m]: ||a* — < e. 

Remark When talking about two dictionaries A and B that arc e-close, we always assume the columns arc ordered 
and scaled correctly so that ||aj — bj \\2 < e. 

Theorem V.l ( lfl 6 l . Theorem 1.4). There is a polynomial time algorithm to learn a p-coherent dictionary A from 
random samples. With high probability, the algorithm returns a dictionary A that is column-wise e-close to A 
given random samples of the form y = AX, where X is drawn from a distribution in class T. Specifically, if 
k < cmin(m^~ 1 )/ ( ' 2£_1 ), l/(p\og d)) and the distribution has bounded (.-wise moments, c > 0 is a constant only 
depending on (, then the algorithm requires n = D((m/fe )^ _1 log m + mk 2 log m log 1 /e) samples and runs in time 
0(n 2 d). 

Summary of the algorithm of lfl 6 l This algorithm, which has fundamental similarities with a concurrent work 
nm consists of two main stages: i) Data Clustering-, the connection graph is built where each node corresponds 
to a column of Y and an edge between y l and yj implies their supports S, and Sj have a non-empty intersection. 
Then, an overlapping clustering procedure is performed over the connection graph to find overlapping maximal 
cliques (with missing edges), ii ) Dictionary Recovery : every cluster in the connection graph represents the set of 


6 The basin of attraction has a swath of 0(k 2 ) 1151 . 


samples associated with a single dictionary atom. After finding these clusters in the connection graph, each atom 
is approximated by the principal eigenvector of the covariance matrix for the data samples in its corresponding 
cluster. 

There are two challenges in extending the above result to the BCS framework: i) during generation of the 
connection graph from data and ii) during computation of the principal eigenvector of the data covariance matrix. 
We address these challenges separately in the following subsections. 

A. Building the data connection graph for BCS 

For building the connection graph, we use the fixed part of the hybrid measurements, i.e. FY with F €= MJ ,f Xd 
drawn from a Gaussian distribution. Computation of the connection graph in lfl 6 l relies on the following lemma. 

Lemma V.l ( lfl6l . Lemma 2.2). Suppose k < 1 / (C p log d) for large enough C (depending on C in the definition 
ofT). Then, if Si and Sj are disjoint, with high probability \(yi,yj)\ < 1/2. 

Without going into the details of the clustering algorithm of lfl6l . we study the conditions under which the 
connection graph does not change when only pj linear measurements from each data sample is given. Let FA € 
R P/Xm be py-coherent. It is not hard to see from the above lemma that if k < l/(C'pf log d), then with high 
probability for disjoint Si and Sj, \(Fyi, Fyj)\ < 1/2. To establish a relationship between //y, p and pj, we use 
the following result from Il22l . 

Lemma V.2 ( l22l . Lemma 3.1). Let x,y € with ||x|| 2 , ||y ||2 < 1- Assume $ € M. nxd is a random matrix with 
independent A/"(0,1/n) entries. Then, for all t > 0 

t 2 

P[|$y) - {x,y)\ >t]<2 exp(—ra + ) 

with Ci = -7= ~ 5.0088 and C 2 = \/8e ~ 7.6885. 

V 07T 

Corollary V.l. Assume F £ W fXd has i.i.d entries from A r (0.1 /pp). Let A be p-coherent and FA be pp-coherent. 
Then, 

P[p f >p + t}< 2 exp (-p f -^-L^-) 
with C 1 and C 2 specified in Lemma \V.2\ 

Proof: Note that the variance of F’ s entries does not have an effect on pf due to the normalization in the 
definition of the coherency and we could assume F’ s entries have variance 1/d as before. We exploit Lemma IV.2I 
by replacing x = cp and y = Fdj and $ = F. Proof is complete by noticing that P[pj > p + 1\ < P[|py — p\ > t] 

_ ■ 

Based on Corollarv lV.il it can be deduced that with high probability p/ < /./ + /pf. Therefore, replacing 

k < cf (p log d.) in the original Theorem IV. 1 1 with k < c/(pf log d) introduces slightly stronger sparsity requirement 
for the success of the algorithm. 


B. Dictionary estimation for BCS 

At this stage, we only exploit the varying part of the measurements Mq(Y) and use p in place of p v for 
simplicity. Let C \, C 2 ,..., C rn be the m discovered overlapping clusters from the previous stage and define the 
empirical covariance matrix Sj = rir Yh y v] f° r the cluster i. The SVD approach0 of |[T6l estimates a* by 
ai which is the principal eigenvector of Sj. Let 
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7 In fact, proposes two methods for dictionary estimation: i) selective averaging and ii) the SVD-based approach. We selected to work 

with the SVD approach due to its more abstract and versatile nature. 

s The principal eigenvector is equivalent to the first singular vector of the covariance matrix. 





denote the empirical covariance matrix resulting from the compressive measurements where (Qj&J) l ^jUj 

as before. Similarly, let di denote the principal eigenvector of E ? ;. Our goal in this section is to show that jd, — a. ? ; 11 2 
is bounded by a small constant for finite n and approaches zero for large n. For this purpose, we use the recent 
results from the area of subspace learning, specifically, subspace learning from compressive measurements lfl8l . A 
critical factor in estimation accuracy of the principle eigenvector of a perturbed covariance matrix is the eigengap 
between the principal and the second eigenvalues of the original covariance matrix. This is a well-known result 
from the works of Chandler Davis and William Kahan known as the Davis-Kahan sine theorem fl9l . 

Consider the following notation. Let life and life denote projection operators onto the principal fc-dimensional 
subspaces of E and E respectively (i.e. the projection onto the top -k eigenvectors). Let ||ITfe — IIfe ||2 denote the 
spectral norm of the difference between life and life. Define the eigengap 7 k as the distance between the /,:'th and 
k+ l’st largest eigenvalues of E. Suppose E is computed from at least £ data samples (\C L \ > l for all i). Moreover, 
assume the data samples have bounded £2 norms, i.e. Vj G [£] : \ ijj \\ 2 < 7 for some positive jj£R. 

Lemma V.3 f fl8l . Theorem 1). With probability at least 1 — <5 

“A - £ 7 (^ log{d/i) + lwi los(<!/<s) ) 


so that one can achieve ||ITfe — life ||2 < e provided that 


i > max 


352p 2 log(d/<5) 16 r/d 2 


pf k e 2 


3 %ep- 


■ log {d/8) 


Below, we present a customization of Lemma I V.3 1 for the 1 2 error of the principal eigenvector estimator. 

Corollary V.2. Let a, and di represent the principal eigenvectors of E, and S, respectively. With probability at 
least 1 — 5 for all i € [m] 


1^2 O 7 H 2 E 




Proof: Clearly, I I 1 = (f dj and ITi = (i fij. As we mentioned in the definition of e-closeness, 0, is implicit in 
the error expression ||dj — a ,*|| 2 requiring that ||dj — dj ||2 < ||Si + an d consequently (dj,dj) > 0. Also note 
that, by definition, for any z € 

||(ni-ni)z|| 2 


\Z\\2 


< Pi - n 


1112 


Now let z = di + dj.. Then 


||(fii -ni)z|| 2 „ , ^ - x J|di -dj|| 2 

-il—M- = (1 + {ai,ai}) 

\\ z \\2 ||®2 + Q > i \\2 

i 

— 2 11 11 2 


Therefore 


I d^ rqlL E 2pi II 
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and the rest follows from Lemma I V.3 1 ■ 

To obtain a lower-bound for the eigengap 71 we need to review some of the intermediate results in lfl 6 l . In fact, 
we compute a lower-bound for 71 of E which serves as a close approximation of 71 when the number of data 
samples £ is large. For every i G [m], let F, be the distribution conditioned on X, f 0. Let a = \ (u, af j for any 
unit-norm u and let 


R 2 = E r J(aj,3^) 2 ] = 1 + ^(a^a.pEr fxj 











denote the projected variance of T, on the direction u = a,;. It is shown ffT6l that generally 


Er, [(u, y) 2 } < a 2 R 2 + 2a\/\ - a 2 ( + (1 - a 2 )( 2 

where £ = max{^, y^J}. 

The principal eigenvector of E, can be computed by finding the unit-norm u that maximizes Ep. {(u. y) 2 ]. 
Meanwhile, it has been established that for u = a*, Ep. [(u, y} 2 } = R 2 . Therefore, the range of a for the principal 
eigenvector must satisfy the inequality (for a < 1 ) 

R 2 < a 2 R 2 + 2av/ 1 — a 2 ( + (1 — a 2 )( 2 


It is not difficult to show this range is 


c >2 _ /-2 

a e [-=^=4^=,1] 

V4C 2 + (^-c 2 ) 2 ’ 

Now, for the second eigenvector and eigenvalue pair we must find a unit-norm v that satisfies (v,u) = 0 and 
maximizes Er\ [(v, 3^) 2 ]. Define j3 = |(n,aj)|. It can be shown that 


ft € [ 


2C _ 2C 

y/ 4C 2 + (Hf - C 2 ) 2 ’ V 4 C 2 + - C 2 ) 2 


Note that the first and the second largest eigenvalues correspond to projected variances of I’, on the directions 
of u and v, respectively. Therefore, based on the derived ranges for a and (3, we are able to find the following 
lower-bound for 71 : 


7 i > Rl - 


me 

R i - c 2 


Note that (j becomes very small as the problem size (d, m, n) becomes large, resulting in 71. « 71. « 1. Therefore, 
given a sufficient number of samples, it can be guaranteed that ai is an accurate estimation of a* and, in turn, an 
accurate estimation of a* even when only p < d measurements per sample is available. Once the dictionary has 
been approximated to within a close distance from the optimal dictionary A*PD, iterative algorithms such as (71, 
El, El, (Toll can assure convergence to a local optimum and therefore perfect recovery as suggested in lfl5l . lfT6l . 
(171 . Finally, perfect recovery of the dictionary results in perfect recovery of X and Y given the CS bounds for the 
number of measurements [T| which are generally weaker than the stated bounds for the recovery of the dictionary. 


VI. Conclusion 

In this work, we studied the conditions for perfect recovery of both the dictionary and the sparse coefficients 
from linear measurements of the data. The first part of this work brings together some of the recent theories 
about the uniqueness of dictionary learning and the blind compressed sensing problem. Moreover, we described a 
‘hybrid’ random measurement scheme that reduces the theoretical bounds for the minimum number of data samples 
to guarantee a unique dictionary and thus perfect recovery for blind compressed sensing. In the second part, we 
discussed the algorithmic aspects of dictionary learning under random linear measurements. It was shown that a 
polynomial-time algorithm can assure convergence to the generative dictionary given a sufficient number of data 
samples with high probability. It would be interesting to explore dictionary learning and blind compressed sensing 
for non-Gaussian random measurements. In particular, when the data matrix is partially observed (i.e. an incomplete 
matrix), data recovery becomes a matrix completion problem where the elements of the data matrix are assumed 
to lie in a union of interconnected rank-/,: subspaces. This is a subject of future work. 

Appendix 

A. Proof of Lemma I III. 1 1 

Let X" := PDX'. Note that A'X' = APDX' = AX". Thus, M P G {AX") = M P G {AX). Our goal is to show 
X" = X and thus A'X' = AX" = AX. To prove X" = X, we must show that for every j € [n], ‘I> ; Ax" = QjAxj 
results in x" = Xj with probability one. For simplicity, we omit the sample index j in the rest of the proof. 













Let S and S" respectively denote the sets of non-zero indices of x and x" where |5|,|S"'| < k. Rewrite 
<&Ax" = ®Ax as ( l’A(x" — x) = 0. Note that x" — x is supported on T = S U S" where \T\ < 2k. Therefore, we 
must show that, with probability one, 

: rank($A T ) = \T\ 

necessitating x" — x = 0 or x" = x. Since Spark (A) > 2k, every 2k columns of A are linearly independent and we 
are able to perform a Gram-Schmidt orthogonalization on Aj- to get Ax = UV where U E M' /x2/l ' is orthonormal 
(d > 2k) and V is a full-rank square matrix. Hence, <££/ E M px2/; is distributed according to i.i.d. Gaussian and is 
full-rank with probability one Eli . We conclude the proof by noticing that rank(<&UV) = rank(&U) = 2k since 
V is a full-rank square matrix. 


VT E 


B. Proof of Lemma \IV.2\ 

Denote a general linear matrix measurement operator M. : W lxn -x M T such that M.(Y) = £ = [£i, ( 2 , ■ ■ ■, Cr] T > 
(i = (Mi,Y) for i E [r]. If we denote 


<b = 


vec(Mi) T 

vec(M2) T 


rxdn 


V6c(Mt) T 

then M.(Y) = <&vec(Y). Specifically, under the Gaussian measurement scheme for BCS, we have: 


(3) 


E RP nxdri (4) 

where non-zero entries of are i.i.d. Gaussian with mean zero and variance 1/d. 

The following result from lfl3l gives the required number of 1 inear measurements to guarantee (with probability 
one) that a rank -r matrix does not fall into the null-space of the measurement operator. 


®cs _ 


$1 




Lemma VI.l Theorem 3.1). Let 1Z be a q-dimensional continuously differentiable manifold over the set of 

d X d real matrices. Suppose we take t > q + 1 linear measurements from Y E 12. Assume there exists a constant 
C = C(d) such that P(|(Mj,X)| < e) < Ce for every Y with ||y||^ = 1. Further assume that for each Y 0 
that the random variables {( Mi,Y )} are independent. Then with probability one, Null(M) f}72\ {0} = 0. 

A careful inspection of the derivation of the above theorem in lfT3l reveals that this result can be extended to 
include the manifolds over the set of rectangular matrices Y E M' /xn . Specifically, for the manifold over rank -r 
d x n matrices we have (see Ifl4l for example) q = dim(7£) = r(d + n — r). 

The following lemma establishes a sufficient lower bound for r to guarantee that M.{A) = J\4(B) results in 
A = B. 


Lemma VI.2. Let 12 denote the manifold over the set of rank-r d X n matrices and let 12' denote the manifold 
over the set of rank-2r d x n matrices. Also let M .: M dxn —» M T with t > dim(7v!/) + 1 = 2 r(d + n — 2r) + 1. 
Then, for any A.B E 12, A4 (A) = A4 (B) implies A = B with probability one. 

Proof: Clearly, t > diiii(7?/) implies r > dm\(12") for any 12" over the set of rank -r" dxn matrices with r" < 
2r. Also note that rank(A — B) < 2k, thus A — B E 12". Now, since A4(A — B) = 0 and Null(M) nlZ" \{0} = 0 
(with probability one, according to Lemma IVI.ll ). we must have A — B = 0 or .4 = B with probability one. ■ 
It only remains to show that A4 ! / ; satisfies the requirements of Lemma IVl. 1 1 As noted in lfl3l l. the requirement 
F(\(Mi,Y)\ < e) < Ce requires that the densities of ( M^,Y) do not spike at the origin; a sufficient condition 
for this to hold for every Y with | V' 11 /r = 1 is that each Mj has i.i.d. entries with a continuous density. Note 
that non-zero entries of are i.i.d. Gaussian and cover every column in Y. Therefore, none of the entries of 
&Q S vec(Y) would spike at the origin or equivalently there exists C = C(d,n ) so that P(|((4>J 1 )j, yf)\ < e) < Ce 
with \\y-j ||2 = G(l/\/n) given that the vector (T^), is drawn from a continuous distribution. 












C. Proof of Lemma \IV.4\ 

Let r = rank(Yj) and k = rarik(FYj). Perform a Gram-Schmidt orthogonalization on Yj to obtain Yj = UV 
where U € M dxr has orthogonal columns and V € M rx l J l is full-rank; hence, given r < |J|, we have k = 
rank(FUV) = rank(FU). Note that, since U is orthogonal and F is i.i.d. Gaussian, FU is also i.i.d. Gaussian. 
Hence, with probability one, FU is full-rank lf2Tl and k = min (pf,r). To conclude the proof, note that when 
k < Pf, necessarily we have k = r. 
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